Harvesting Entities from the Web Using Unique Identifiers - IBEX

نویسندگان

  • Aliaksandr Talaika
  • Joanna Biega
  • Antoine Amarilli
  • Fabian M. Suchanek
چکیده

In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73–96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Harvesting Entities from the Web Using Unique Identifiers – IBEX Extraction des entités du Web à l’aide d’identifiants uniques – IBEX

In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extract...

متن کامل

Harvesting Entities from the Web Using Unique Identifiers

In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with humanreadable names for the entities at large scale. Starting with a simple extracti...

متن کامل

From Web Data to Entities and Back

We present the Entity Name System (ENS), an enabling infrastructure, which can host descriptions of named entities and provide unique identifiers, on large-scale. In this way, it opens new perspectives to realize entity-oriented, rather than keyword-oriented, Web information systems. We describe the architecture and the functionality of the ENS, along with tools, which all contribute to realize...

متن کامل

Harvesting Wiki Consensus - Using Wikipedia Entries as Ontology Elements

One major obstacle towards adding machine-readable annotation to existing Web content is the lack of domain ontologies. While FOAF and Dublin Core are popular means for expressing relationships between Web resources and between Web resources and literal values, we widely lack unique identifiers for common concepts and instances. Also, most available ontologies have a very weak community groundi...

متن کامل

Organizing Thematic, Geographic, and Temporal Knowledge in a Well-Founded Navigation Space: Logical and Algorithmic Foundations for EFGT Nets

We introduce a family of symbolic logical formalisms for reasoning with named entities, associated topics or thematic fields, geographic areas, and temporal periods. We argue that this kind of knowledge is useful for various applications in a Semantic Web context, in other words, for the content-oriented description of Web services and yellow pages. In our approach, entities and their relations...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1505.00841  شماره 

صفحات  -

تاریخ انتشار 2015